-
Notifications
You must be signed in to change notification settings - Fork 395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/efa: Disable resource management when user set FI_OPT_EFA_RNR_RETRY #10560
base: main
Are you sure you want to change the base?
Conversation
Changes LGTM, would u mind adding a unit test? |
It seems rdm_rnr_queue_resend test conflicts with your PR. This test assumes RM is enabled after setopt. We need to update this test accordingly |
If a user is explicitly asking for a retry count that means they also want to manage the resources themselves and the EFA provider should disable resource management to prevent implicit retries. Also remove the unused function. Signed-off-by: Jessie Yang <[email protected]>
if (ret) { | ||
FT_PRINTERR("fi_setopt", -ret); | ||
FT_PRINTERR("setenv", -ret); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think setenv works here because fi_param_get
happens at fi_getinfo where we already have a efa_env.rnr_retry set as 3.
Also, this is shared among the rnr_cq_read_error and queue_resend test. The former intends to turn off RM, the latter one intends to turn on RM. It's totally fine for rnr_cq_read_error to call fi_setopt, but not queue_resend.
So we should have a bool somewhere to call fi_setopt only for cq_read_error test. And make queue_resend_test setenv before initing any resources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the code again, Nah. rnr_retry is not really environment variable even though it is in efa_env
so setenv won't work at all.
If a user is explicitly asking for a retry count that means they also want to manage the resources themselves and the EFA provider should disable resource management to prevent implicit retries.
Also remove the unused function.